Dog Bites in New York Dataset Exercise

Tijana Blagojev - R-Ladies Belgrade

Aim of the Exercise

  • We will get acquainted with how R is functioning

  • We will learn about different types of variables

  • We will just scratch a surface of several R packages like parts of tidyverse (dplyr and ggplot)

  • We will create a dashboard with information contained in dog bites dataset

First steps

  • After installing R and R studio you need to set a working directory where all your work will be stored.

  • The best way to do this is to choose File/New Project which will automatically store all your information in same place.

Exercise

R Interface

Packages and Libraries

When you install R, you have basic functions already available within Base R. You can take a look at Introduction to Base R for additional information.

However, in order to access functions or data written by other people there are numerious R packages available.

An R package is a bundle of functions (code), data, documentation, vignettes (examples).

Important note - R is case-sensitive so make sure to check spelling and capitalization!

Packages and Libraries-Code

To access information in R packages they first need to be installed and then accessed through their libraries. Use the following code to install packages and load libraries.

Simple use of R

Type in your console the following command and press enter.

## [1] 4

You use <- to create objects in R. It is called an assignement operator.

## [1] 15

Dataset

The data set on dog bites is taken from R package nycdogs by Kieran Healy. For our exercise it is adapted only to include year 2017 and several variables. So let us see how the dataset looks like.

Important note: You will rarely come accross the dataset that is already prepared for analysis. Usually, you will spend between 50% - 80% of your time on cleaning and preparing data.

Importing a dataset

First, we will import and inspect a csv file about dog bites in New York City for 2017 with the following code.

Inspecting dataset

There are 3072 rows that we will refer to observations and 6 columns that we will call variables. As you may also see, we have different types of variables such as character, date, double (continuous).

## Observations: 3,072
## Variables: 6
## $ date_of_bite <date> 2017-01-02, 2017-01-02, 2017-01-04, 2017-01-07, 2017-01…
## $ breed        <chr> "Labrador Retriever Crossbreed", "Lhasa Apso", "Pit Bull…
## $ gender       <chr> "Male", "Male", "Unknown", "Unknown", "Male", "Unknown",…
## $ spay_neuter  <chr> "No", "Yes", "No", "No", "Yes", "No", "No", "No", "No", …
## $ borough      <chr> "Brooklyn", "Brooklyn", "Brooklyn", "Brooklyn", "Brookly…
## $ zip_code     <dbl> 11231, 11211, 11219, 11216, 11216, 11229, 11216, 11206, …

Variables

Numeric and Categorical

Numeric can be:

  • Integer: Age, number of kittens

  • Double (Continuous): Height, weight

Categorical:

  • Character: Black, yellow, white

  • Factor (Ordinal): Cold, mild, warm, hot

Creating R markdown dashboard presentation

In top left corner press a document with the plus sign icon and choose R Markdown. Then open Flex Dashboard template.

Creating R markdown dashboard presentation

Flexdashboard Template

Setting up the Appearance of Flexdashboard

Pipe operator

In tidyverse package there is a so-called “pipe” operator %>%. It passes the result of the left hand-side as the first operator argument of the function on the right handside. It is used to connect multiple operations on data together.

Setup part of the R-markdown-Dashboard Code

In the Setup part code, we will import a dog bites data set and create a subset for number of bites per boroughs that we will use in textual part of our dashboard.

Number of Bites per Borough in New York

Now let us take a look at the 5 boroughs with the highest number of bites

## # A tibble: 5 x 3
##   borough           n  perc
##   <chr>         <int> <dbl>
## 1 Queens          817    27
## 2 Brooklyn        690    22
## 3 Manhattan       663    22
## 4 Bronx           506    16
## 5 Staten Island   284     9

Textual part of the dashboard

We will use tick `, followed by r and some function and closed with another tick as a formula that will automatically add information in the text, so if we use a subset for another year it will update the data in the text straight away. To access particular value in a dataset you can use the following code where the first number is the number of row and the second one the number of column.

## # A tibble: 1 x 1
##   borough
##   <chr>  
## 1 Queens

Textual part of the dashboard-Code

Textual part of the dashboard result

Congratulations you just coded and knitted your first dashboard!!!

Creating a Searchable Datatable

First, in a Setup part of our dashboard document we will create a table without last column related to zip codes.

Now we will add a searchable table in second row of the first column designated with ### with the help of DT package.

Dasboard progress

Creating a Bar Chart

First, we will create a subset to see which are the three top breed bitters. We will again put this part of code in the first Setup part of our R dashboard/R markdown file. We will also change breed variable from character into factor.

Using ggplot and plotly

We will use two packages, one (ggplot) to make a bar graph and another one (plotly) to make the graph’s information pop up when hovering. Ggplot is a package created by Hadley Wickam that is based on a grammar of graphics.

Grammar of Graphics

Enables you to specify building blocks of a plot and to combine them to create graphical display you want.

  • data

  • aesthetic mapping

  • geometric object

  • statistical transformations

  • scales

  • coordinate system

  • position adjustments

  • faceting

Creating bar graph

Instead of Chart B we will write: Three breeds with highest number of bites in 2017 and use this a code for a bar chart.

Bar Chart

Dasboard Progress

Final Stage - Braaavoo!!!!

Stacked Bar of Spayed/Neutered Dogs

In this final part, we will create a stacked bar chart which will show how many dogs that bit were spayed/neutered and how many of them were male or female. So we will again in Setup part create a subset grouped by spay/neuter and gender. We will also create another column to use as pop-up label.

The datadogsgenderspay subset

spay_neuter gender n Info
No Female 271 <br> Spay/Neuter: No <br> Number of bites: 271 <br> Gender: Female <br>
No Male 682 <br> Spay/Neuter: No <br> Number of bites: 682 <br> Gender: Male <br>
No Unknown 1063 <br> Spay/Neuter: No <br> Number of bites: 1063 <br> Gender: Unknown <br>
Yes Female 290 <br> Spay/Neuter: Yes <br> Number of bites: 290 <br> Gender: Female <br>
Yes Male 755 <br> Spay/Neuter: Yes <br> Number of bites: 755 <br> Gender: Male <br>
Yes Unknown 11 <br> Spay/Neuter: Yes <br> Number of bites: 11 <br> Gender: Unknown <br>

Creating stacked bar graph

Instead of Chart C we will write and center title: Bites based on dog’s gender and whether they were spayed/neutered {align=center} and use this a code for a stacked bar:

Stacked Bar

Dashboard Completed

Word of Caution in this Tale

  • “We infer that something we see in the data applies beyond the time, place and conditions in which it happened to surface.” Ben Jones Avoiding Data Pitfalls.

  • In order to say that Pit Bulls are really agressive we need to do additional research.

  • Is it relevant to make conclusions with this number of observations (is the data reliable)?

  • That is why experts need to be able to create this type of visualisations, they already have expertise needed to make conclusion and this tool can help them reach wider audiences.

Great Work and Thank you!